Search CORE

37 research outputs found

Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance

Author: Asgari Ehsaneddin
Mofrad Mohammad R. K.
Publication venue
Publication date: 28/04/2016
Field of study

We introduce a new measure of distance between languages based on word embedding, called word embedding language divergence (WELD). WELD is defined as divergence between unified similarity distribution of words between languages. Using such a measure, we perform language comparison for fifty natural languages and twelve genetic languages. Our natural language dataset is a collection of sentence-aligned parallel corpora from bible translations for fifty languages spanning a variety of language families. Although we use parallel corpora, which guarantees having the same content in all languages, interestingly in many cases languages within the same family cluster together. In addition to natural languages, we perform language comparison for the coding regions in the genomes of 12 different organisms (4 plants, 6 animals, and two human subjects). Our result confirms a significant high-level difference in the genetic language model of humans/animals versus plants. The proposed method is a step toward defining a quantitative measure of similarity between languages, with applications in languages classification, genre identification, dialect identification, and evaluation of translations

arXiv.org e-Print Archive

eScholarship - University of California

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Author: Alborzi Seyed Ziaeddin
Altenhoff Adrian
Amezola Miguel
Antczak Magdalena
Aridhi Sabeur
Asgari Ehsaneddin
Atalay Volkan
Babbitt Patricia C.
Barot Meet
Ben-Hur Asa
Benso Alfredo
Bergquist Timothy R.
Berselli Michele
Bhat Prajwal
Björne Jari
Black Gage S.
Boecker Florian
Bonneau Richard
Borukhov Itamar
Bosco Giovanni
Boudellioua Imane
Brackenridge Danielle A.
Brenner Steven E.
Cao Renzhi
Carraro Marco
Casadio Rita
Cetin-Atalay Rengul
Chandler Caleb
Chang Jia-Ming
Cheng Jianlin
Chi Po-Han
Cozzetto Domenico
Crocker Alex W.
Dai Suyang
Dalkiran Alperen
Das Sayoni
Davidović Radoslav S.
Davis Larry
Dayton Jonathan B.
Dessimoz Christophe
Devignes Marie-Dominique
Di Carlo Stefano
Dogan Tunca
Dzeroski Saso
Emily Koo Da Chen
Fa Rui
Fabris Fabio
Falda Marco
Fang Hai
Fernández José M.
Fontana Paolo
Frank Yotam
Frasca Marco
Freddolino Peter L.
Freitas Alex A.
Friedberg Iddo
Gemovic Branislava
Georghiou George
Ginter Filip
Gligorijević Vladimir
Goldberg Tatyana
Gough Julian
Greene Casey S.
Grossi Giuliano
Hakala Kai
Hamid Md Nafiz
Hoehndorf Robert
Hogan Deborah A.
Holm Liisa
Hou Jie
Hou Jie
Hurto Rebecca L.
Jain Aashish
Jeffery Constance J.
Jiang Yuxiang
Jo Dane
Johnson Devon
Jones David T.
Kacsoh Balint Z.
Kaewphan Suwisa
Kahanda Indika
Kihara Daisuke
Kulmanov Maxat
Larsen Dallas J.
Lavezzo Enrico
Lee Alexandra J.
Lees Jonathan Gill
Lewis Kimberley A.
Liao Wen-Hung
Lichtarge Olivier
Linial Michal
Liu Yi-Wei
Mao Qizhong
Martelli Pier Luigi
Martin Maria J.
McGuffin Liam
McHardy Alice C.
Medlar Alan J.
Mehryary Farrokh
Mesiti Marco
Moen Hans
Mofrad Mohammad R. K.
Mooney Sean D.
Nguyen Huy N.
Notaro Marco
Novikov Ilya
Omdahl Ashton R.
Orengo Christine A.
O’Donovan Claire
Paccanaro Alberto
Pascarelli Stefano
Perovic Vladimir R.
Petrini Alessandro
Piovesan Damiano
Politano Gianfranco
Profiti Giuseppe
Radivojac Predrag
Re Matteo
Reeb Jonas
Rehman Hafeez Ur
Renaux Alexandre
Rifaioglu Ahmet S.
Ritchie David W.
Roche Daniel B.
Rodriguez Jose Manuel
Romero Alfonso E.
Rose Peter W.
Rost Burkhard
Sagers Luke W.
Saidi Rabie
Salakoski Tapio
Savojardo Castrense
Sillitoe Ian
Suh Erica
Sumonja Neven
Supek Fran
Thurlby Natalie
Tian Weidong
Tolvanen Martti E. E.
Toppo Stefano
Torres Mateo
Tosatto Silvio C. E.
Tress Michael L.
Tseng Wei-Cheng
Törönen Petri
Valentini Giorgio
Veljkovic Nevena
Vesztrocy Alex Wiarwick
Vidulin Vedrana
Vucetic Slobodan
Wan Cen
Wang Zheng
Wass Mark N.
Wilkins Angela
Yang Haixuan
Yao Shuwei
You Ronghui
Yunes Jeffrey M.
Zhang Chengxin
Zhang Feng
Zhang Shanshan
Zhang Yang
Zhang Zihan
Zhao Chenguang
Zhou Naihui
Zhu Shanfeng
Zosa Elaine
Šmuc Tomislav
Publication venue
Publication date: 01/01/2019
Field of study

Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Results Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. Conclusion We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.Peer reviewe

HAL-CentraleSupelec

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

REPISALUD

Archivio istituzionale della ricerca - Università di Padova

Helmholtz Zentrum für Infektionsforschung Repository

Central Archive at the University of Reading

AIR Universita degli studi di Milano

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Repository of the Vinča Nuclear Institute (VinaR)

OpenMETU (Middle East Technical University)

Explore Bristol Research

Deep Blue Documents at the University of Michigan

Archivio istituzionale della ricerca - Fondazione Edmund Mach

HAL Clermont Université

HAL Descartes

Helsingin yliopiston digitaalinen arkisto

Hal-Diderot

Hacettepe University Institutional Repository

Repository for Publications and Research Data

INRIA a CCSD electronic archive server

UCL Discovery

Kent Academic Repository

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Author: Aashish Jain
Adrian Altenhoff
Ahmet S. Rifaioglu
Alan J. Medlar
Alberto Paccanaro
Alessandro Petrini
Alex A. Freitas
Alex W. Crocker
Alex Warwick Vesztrocy
Alexandra J. Lee
Alexandre Renaux
Alfonso E. Romero
Alfredo Benso
Alice C. McHardy
Alperen Dalkıran
Angela Wilkins
Asa Ben-Hur
Ashton R. Omdahl
Balint Z. Kacsoh
Branislava Gemovic
Burkhard Rost
Caleb Chandler
Casey S. Greene
Castrense Savojardo
Cen Wan
Chenguang Zhao
Chengxin Zhang
Christine A. Orengo
Christophe Dessimoz
Claire O’Donovan
Constance J. Jeffery
Da Chen Emily Koo
Daisuke Kihara
Dallas J. Larsen
Damiano Piovesan
Dane Jo
Daniel B. Roche
Danielle A. Brackenridge
David T. Jones
David W. Ritchie
Deborah A. Hogan
Devon Johnson
Domenico Cozzetto
Ehsaneddin Asgari
Elaine Zosa
Enrico Lavezzo
Erica Suh
Fabio Fabris
Farrokh Mehryary
Feng Zhang
Filip Ginter
Florian Boecker
Fran Supek
Gage S. Black
George Georghiou
Gianfranco Politano
Giorgio Valentini
Giovanni Bosco
Giuliano Grossi
Giuseppe Profiti
Hafeez Ur Rehman
Hai Fang
Haixuan Yang
Hans Moen
Heiko Schoof
Huy N. Nguyen
Ian Sillitoe
Iddo Friedberg
Ilya Novikov
Imane Boudellioua
Indika Kahanda
Itamar Borukhov
Jari Björne
Jeffrey M. Yunes
Jia-Ming Chang
Jianlin Cheng
Jie Hou
Jonas Reeb
Jonathan B. Dayton
Jonathan Gill Lees
Jose Manuel Rodriguez
José M. Fernández
Julian Gough
Kai Hakala
Kimberley A. Lewis
Larry Davis
Liam J. McGuffin
Liisa Holm
Magdalena Antczak
Marco Carraro
Marco Falda
Marco Frasca
Marco Mesiti
Marco Notaro
Maria J. Martin
Marie-Dominique Devignes
Mark N. Wass
Martti E.E. Tolvanen
Mateo Torres
Matteo Re
Maxat Kulmanov
Md Nafiz Hamid
Meet Barot
Michael L. Tress
Michal Linial
Michele Berselli
Miguel Amezola
Mohammad R.K. Mofrad
Naihui Zhou
Natalie Thurlby
Neven Sumonja
Nevena Veljkovic
Olivier Lichtarge
Paolo Fontana
Patricia C. Babbitt
Peter L. Freddolino
Peter W. Rose
Petri Törönen
Pier Luigi Martelli
Po-Han Chi
Prajwal Bhat
Predrag Radivojac
Qizhong Mao
Rabie Saidi
Radoslav S. Davidović
Rebecca L. Hurto
Rengul Cetin Atalay
Renzhi Cao
Richard Bonneau
Rita Casadio
Robert Hoehndorf
Ronghui You
Rui Fa
Sabeur Aridhi
Saso Dzeroski
Sayoni Das
Sean D. Mooney
Seyed Ziaeddin Alborzi
Shanfeng Zhu
Shanshan Zhang
Shuwei Yao
Silvio C.E. Tosatto
Slobodan Vucetic
Stefano Di Carlo
Stefano Pascarelli
Stefano Toppo
Steven E. Brenner
Suwisa Kaewphan
Suyang Dai
Tapio Salakoski
Tatyana Goldberg
Timothy R. Bergquist
Tomislav Šmuc
Tunca Dogan
Vedrana Vidulin
Vladimir Gligorijević
Vladimir R. Perovic
Volkan Atalay
Wei-Cheng Tseng
Weidong Tian
Wen-Hung Liao
Yang Zhang
Yi-Wei Liu
Yotam Frank
Yuxiang Jiang
Zheng Wang
Zihan Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/10/2022
Field of study

BackgroundThe Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function.ResultsHere, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory.ConclusionWe conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.</p

UTUPub

MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples.

Author: Asgari Ehsaneddin,
Publication venue
Publication date: 04/11/2018
Field of study

Ezid

Recommended from our members

Life Language Processing: Deep Learning-based Language-agnostic Processing of Proteomics, Genomics/Metagenomics, and Human Languages

Author: Asgari Ehsaneddin
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

A broad and simple definition of `language' is a set of sequences constructed from a finite set of symbols. By this definition, biological sequences, human languages, and many sequential phenomena that exist in the world can be viewed as languages. Although this definition is simple, it includes languages employing very complicated grammars in the creation of their sequences of symbols. Examples are biophysical principles governing biological sequences (e.g., DNA, RNA, and protein sequences), as well as grammars of human languages determining the structure of clauses and sentences. This dissertation uses a language-agnostic point of view in the processing of both biological sequences and human languages. Two main strategies are adopted toward this purpose, (i) character-level, or more accurately, subsequence-level processing of languages, which allows for simple modeling of the sequence similarities based on local information or, bag-of-subsequences, (ii) language model based representation learning encoding contextual information of sequence elements using the neural network language models. I propose language-agnostic and subsequence-based language processing using the above-mentioned strategies in addressing three main research problems in proteomics, genomics/metagenomics, and natural languages using the same point-of-view.One of the main challenges in proteomics is that there exists a large gap between the number of known protein sequences and known protein structures/functions. The central question here is how to efficiently use large numbers of sequences to achieve a better performance in the structural and functional annotation of protein sequences. Here, we proposed subsequence-based representations of protein sequences and their language model-based embeddings trained over a large dataset of protein sequences, which we called protein vectors (or ProtVec). In addition, we introduced a motif discovery approach, benefiting from probabilistic segmentation of protein sequences to find functional and structural motifs. This segmentation is also inferred from large protein sequence datasets. The ProtVec approach has proved a seminal contribution in protein informatics and now is widely used for machine learning based protein structure and function annotations. We showed in different protein informatics tasks that bag-of-subsequences and protein embeddings are complementary information for language-agnostic prediction of protein structures and functions, which also achieved the state-of-the-art performance in the 2 out of 3 tasks of Critical Assessment of protein Function Annotation (CAFA) in 2018 (CAFA 3.14). Moreover, we systematically investigated the role of representation and deep learning architecture in protein secondary structure prediction from the primary sequence. Publicly available tools are provided for achieving state-of-the-art performance accuracy that can be further expanded by the community.One of the prominent challenges in metagenomics involves the host phenotypic characterization based on the associated microbial samples. Microbial communities exist almost on every accessible surface on earth, supporting, regulating, and even causing unwanted conditions (e.g., diseases) to their hosts and environments. Detection of the host phenotype and the phenotype-specific taxa from the microbial samples is the chief goal here. For instance, identifying distinctive taxa for microbiome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on the accuracy of microbiome analysis techniques. Here, we propose two distinct language-agnostic subsequence-based processing methods for machine learning on 16S rRNA sequencing, currently the most cost-effective approach for sequencing of microbial communities. We propose alignment- and reference- free methods, called MicroPheno and DiTaxa, designed for microbial phenotype and biomarker detection, respectively. MicroPheno is a k-mer based approach achieving the state-of-the-art performance in the host phenotype prediction from 16S rRNA outperforming conventional OTU features. DiTaxa, substitutes standard OTU-clustering by segmenting 16S rRNA reads into the most frequent variable-length subsequences. We compared the performance of DiTaxa to the state-of-the-art methods in phenotype and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis, and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa performed competitively to MicroPheno (state-of-the-art approach) in phenotype prediction while outperforming the OTU-based state-of-the-art approach in finding biomarkers in both resolution and coverage evaluated over known links from literature and synthetic benchmark datasets.The third central problem we addressed in this dissertation is focused on human languages. Many of 7000 world's natural languages are low-resource and lack digitized linguistic resources. This has put many of these human languages in danger of extinction and has motivated developing methods for automatic creation of linguistic resources and linguistic knowledge for low-resource languages. To address this problem via our language-agnostic point of view (by not treating different languages differently), we develop SuperPivot for subsequence-based linguist marker detection in parallel corpora of 1000 languages, which was the first computational investigation for linguistic resource creation in such a scale. As an example, SuperPivot was used to study the typology of tense in 1000 languages. Next, we utilized SuperPivot for the creation of the largest sentiment lexicon to date in terms of the number of covered languages (1000+ languages) achieving macro-F1 over 0.75 on word sentiment prediction for most evaluated languages, meaning that we enable sentiment analysis in many low resource languages. To ensure the usability of UniSent lexica for any new domain, we propose DomDrift, a method quantifying the semantic changes of words in the sentiment lexicon in the new domain. Next, we extend the DomDrift method to quantifying the semantic changes of all words in the language. We proposed a new metric for language comparisons based on the language word embedding graphs requiring only monolingual embeddings and word mapping between languages obtained through statistical alignment in parallel corpora. We performed language comparison for fifty natural languages and twelve genetic language variations of different organisms. As a result, natural languages of the same family were clustered together. In addition, applying the same method on organisms' genomes confirmed a high-level difference in the genetic language model of humans/animals versus plants. This method called word embedding language divergence is a step toward unsupervised or minimally supervised comparison of languages in their broad definition

eScholarship - University of California

MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples

Author: Asgari Ehsaneddin,
Publication venue
Publication date: 05/10/2020
Field of study

Ezid

Replication Data for: Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

Author: Asgari Ehsaneddin
Publication venue: Harvard Dataverse
Publication date
Field of study

Users should cite: Asgari E, Mofrad MRK. <a href='http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141287' >Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS ONE 10(11): e0141287. doi:10.1371/journal.pone.0141287. This archive also contains the family classification data that we used in the above mentioned PLoS ONE paper. This data can be used as a benchmark for family classification task

Harvard Dataverse Network